8 research outputs found
Enabling mixed-precision quantized neural networks in extreme-edge devices
The deployment of Quantized Neural Networks (QNN) on advanced microcontrollers requires optimized software to exploit digital signal processing (DSP) extensions of modern instruction set architectures (ISA). As such, recent research proposed optimized libraries for QNNs (from 8-bit to 2-bit) such as CMSIS-NN and PULP-NN. This work presents an extension to the PULP-NN library targeting the acceleration of mixed-precision Deep Neural Networks, an emerging paradigm able to significantly shrink the memory footprint of deep neural networks with negligible accuracy loss. The library, composed of 27 kernels, one for each permutation of input feature maps, weights, and output feature maps precision (considering 8-bit, 4-bit and 2-bit), enables efficient inference of QNN on parallel ultra-low-power (PULP) clusters of RISC-V based processors, featuring the RV32IMCXpulpV2 ISA. The proposed solution, benchmarked on an 8-cores GAP-8 PULP cluster, reaches peak performance of 16 MACs/cycle on 8 cores, performing 21
7 to 25
7 faster than an STM32H7 (powered by an ARM Cortex M7 processor) with 15
7 to 21
7 better energy efficiency
Enabling Mixed-Precision Quantized Neural Networks in Extreme-Edge Devices
The deployment of Quantized Neural Networks (QNN) on advanced
microcontrollers requires optimized software to exploit digital signal
processing (DSP) extensions of modern instruction set architectures (ISA). As
such, recent research proposed optimized libraries for QNNs (from 8-bit to
2-bit) such as CMSIS-NN and PULP-NN. This work presents an extension to the
PULP-NN library targeting the acceleration of mixed-precision Deep Neural
Networks, an emerging paradigm able to significantly shrink the memory
footprint of deep neural networks with negligible accuracy loss. The library,
composed of 27 kernels, one for each permutation of input feature maps,
weights, and output feature maps precision (considering 8-bit, 4-bit and
2-bit), enables efficient inference of QNN on parallel ultra-low-power (PULP)
clusters of RISC-V based processors, featuring the RV32IMCXpulpV2 ISA. The
proposed solution, benchmarked on an 8-cores GAP-8 PULP cluster, reaches peak
performance of 16 MACs/cycle on 8 cores, performing 21x to 25x faster than an
STM32H7 (powered by an ARM Cortex M7 processor) with 15x to 21x better energy
efficiency.Comment: 4 pages, 6 figures, published in 17th ACM International Conference on
Computing Frontiers (CF '20), May 11--13, 2020, Catania, Ital
DORY: Automatic End-to-End Deployment of Real-World DNNs on Low-Cost IoT MCUs
The deployment of Deep Neural Networks (DNNs) on end-nodes at the extreme
edge of the Internet-of-Things is a critical enabler to support pervasive Deep
Learning-enhanced applications. Low-Cost MCU-based end-nodes have limited
on-chip memory and often replace caches with scratchpads, to reduce area
overheads and increase energy efficiency -- requiring explicit DMA-based memory
transfers between different levels of the memory hierarchy. Mapping modern DNNs
on these systems requires aggressive topology-dependent tiling and
double-buffering. In this work, we propose DORY (Deployment Oriented to memoRY)
- an automatic tool to deploy DNNs on low cost MCUs with typically less than
1MB of on-chip SRAM memory. DORY abstracts tiling as a Constraint Programming
(CP) problem: it maximizes L1 memory utilization under the topological
constraints imposed by each DNN layer. Then, it generates ANSI C code to
orchestrate off- and on-chip transfers and computation phases. Furthermore, to
maximize speed, DORY augments the CP formulation with heuristics promoting
performance-effective tile sizes. As a case study for DORY, we target
GreenWaves Technologies GAP8, one of the most advanced parallel ultra-low power
MCU-class devices on the market. On this device, DORY achieves up to 2.5x
better MAC/cycle than the GreenWaves proprietary software solution and 18.1x
better than the state-of-the-art result on an STM32-F746 MCU on single layers.
Using our tool, GAP-8 can perform end-to-end inference of a 1.0-MobileNet-128
network consuming just 63 pJ/MAC on average @ 4.3 fps - 15.4x better than an
STM32-F746. We release all our developments - the DORY framework, the optimized
backend kernels, and the related heuristics - as open-source software.Comment: 14 pages, 12 figures, 4 tables, 2 listings. Accepted for publication
in IEEE Transactions on Computers
(https://ieeexplore.ieee.org/document/9381618
GVSoC: A Highly Configurable, Fast and Accurate Full-Platform Simulator for RISC-V based IoT Processors
open6siembargoed_20220427Bruschi, Nazareno; Haugou, Germain; Tagliavini, Giuseppe; Conti, Francesco; Benini, Luca; Rossi, DavideBruschi, Nazareno; Haugou, Germain; Tagliavini, Giuseppe; Conti, Francesco; Benini, Luca; Rossi, David
Scale up your In-Memory Accelerator: leveraging wireless-on-chip communication for AIMC-based CNN inference
Analog In-Memory Computing (AIMC) is emerging as a disruptive paradigm for heterogeneous computing, potentially delivering orders of magnitude better peak performance and efficiency over traditional digital signal processing architectures on Matrix-Vector multiplication. However, to sustain this throughput in real-world applications, AIMC tiles must be supplied with data at very high bandwidth and low latency; this poses an unprecedented pressure on the on-chip communication infrastructure, which becomes the system's performance and efficiency bottleneck. In this context, the performance and plasticity of emerging on-chip wireless communication paradigms provide the required breakthrough to up-scale on-chip communication in large AIMC devices. This work presents a many-tile AIMC architecture with inter-tile wireless communication that integrates multiple heterogeneous computing clusters, embedding a mix of parallel RISC-V cores and AIMC tiles. We perform an extensive design space exploration of the proposed architecture and discuss the benefits of exploiting emerging on-chip communication technologies such as wireless transceivers in the millimeter-wave and terahertz bands.This work was supported by the WiPLASH project (g.a. 863337), founded from the European Union’s Horizon 2020 research and innovation program.Peer ReviewedPostprint (author's final draft
A 3 TOPS/W RISC-V Parallel Cluster for Inference of Fine-Grain Mixed-Precision Quantized Neural Networks
The emerging trend of deploying complex algorithms, such as Deep Neural networks (DNNs), increasingly poses strict memory and energy efficiency requirements on Internet-of-Things (IoT) end-nodes. Mixed-precision quantization has been proposed as a technique to minimize a DNN’s memory footprint and maximize its execution efficiency, with negligible end-to-end precision degradation. In this work, we present a novel hardware and software stack for energy-efficient inference of mixed-precision Quantized Neural Networks (QNNs). We introduce Flex-V, a processor based on the RISC-V Instruction Set Architecture (ISA) that features fused Mac&Load mixed-precision dot product instructions; to avoid the exponential growth of the encoding space due to mixed-precision variants, we encode formats into the Control-Status Registers (CSRs). Flex-V core is integrated into a tightly-coupled cluster of eight processors; in addition, we provide a full framework for the end-to-end deployment of DNNs including a compiler, optimized libraries, and a memory-aware deployment flow. Our results show up to 91.5 MAC/cycle and 3.26 TOPS/W on the cluster, implemented in a commercial 22nm FDX technology, with up to 8.5× speed-up, and an area overhead of only 5.6% with respect to the baseline. To demonstrate the capabilities of the architecture, we benchmark it with end-to-end real-life QNNs, improving performance by 2 × −2.5× with respect to existing solutions using fully flexible programmable processors
Scale up your In-Memory Accelerator: Leveraging Wireless-on-Chip Communication for AIMC-based CNN Inference
Analog In-Memory Computing (AIMC) is emerging as a disruptive paradigm for
heterogeneous computing, potentially delivering orders of magnitude better peak
performance and efficiency over traditional digital signal processing
architectures on Matrix-Vector multiplication. However, to sustain this
throughput in real-world applications, AIMC tiles must be supplied with data at
very high bandwidth and low latency; this poses an unprecedented pressure on
the on-chip communication infrastructure, which becomes the system's
performance and efficiency bottleneck. In this context, the performance and
plasticity of emerging on-chip wireless communication paradigms provide the
required breakthrough to up-scale on-chip communication in large AIMC devices.
This work presents a many-tile AIMC architecture with inter-tile wireless
communication that integrates multiple heterogeneous computing clusters,
embedding a mix of parallel RISC-V cores and AIMC tiles. We perform an
extensive design space exploration of the proposed architecture and discuss the
benefits of exploiting emerging on-chip communication technologies such as
wireless transceivers in the millimeter-wave and terahertz bands